12 research outputs found

    A Scalable Parallel Architecture with FPGA-Based Network Processor for Scientific Computing

    Get PDF
    This thesis discuss the design and the implementation of an FPGA-Based Network Processor for scientific computing, like Lattice Quantum ChromoDinamycs (LQCD) and fluid-dynamics applications based on Lattice Boltzmann Methods (LBM). State-of-the-art programs in this (and other similar) applications have a large degree of available parallelism, that can be easily exploited on massively parallel systems, provided the underlying communication network has not only high-bandwidth but also low-latency. I have designed in details, built and tested in hardware, firmware and software an implementation of a Network Processor, tailored for the most recent families of multi-core processors. The implementation has been developed on an FPGA device to easily interface the logic of NWP with the CPU I/O sub-system. In this work I have assessed several ways to move data between the main memory of the CPU and the I/O sub-system to exploit high data throughput and low latency, enabling the use of “Programmed Input Output” (PIO), “Direct Memory Access” (DMA) and “Write Combining” memory-settings. On the software side, I developed and test a device driver for the Linux operating system to access the NWP device, as well as a system library to efficiently access the network device from user-applications. This thesis demonstrates the feasibility of a network infrastructure that saturates the maximum bandwidth of the I/O sub-systems available on recent CPUs, and reduces communication latencies to values very close to those needed by the processor to move data across the chip boundary

    A Scalable Parallel Architecture with FPGA-Based Network Processor for Scientific Computing

    Get PDF
    This thesis discuss the design and the implementation of an FPGA-Based Network Processor for scientific computing, like Lattice Quantum ChromoDinamycs (LQCD) and fluid-dynamics applications based on Lattice Boltzmann Methods (LBM). State-of-the-art programs in this (and other similar) applications have a large degree of available parallelism, that can be easily exploited on massively parallel systems, provided the underlying communication network has not only high-bandwidth but also low-latency. I have designed in details, built and tested in hardware, firmware and software an implementation of a Network Processor, tailored for the most recent families of multi-core processors. The implementation has been developed on an FPGA device to easily interface the logic of NWP with the CPU I/O sub-system. In this work I have assessed several ways to move data between the main memory of the CPU and the I/O sub-system to exploit high data throughput and low latency, enabling the use of “Programmed Input Output” (PIO), “Direct Memory Access” (DMA) and “Write Combining” memory-settings. On the software side, I developed and test a device driver for the Linux operating system to access the NWP device, as well as a system library to efficiently access the network device from user-applications. This thesis demonstrates the feasibility of a network infrastructure that saturates the maximum bandwidth of the I/O sub-systems available on recent CPUs, and reduces communication latencies to values very close to those needed by the processor to move data across the chip boundary

    An FPGA-based Torus Communication Network

    Full text link
    We describe the design and FPGA implementation of a 3D torus network (TNW) to provide nearest-neighbor communications between commodity multi-core processors. The aim of this project is to build up tightly interconnected and scalable parallel systems for scientific computing. The design includes the VHDL code to implement on latest FPGA devices a network processor, which can be accessed by the CPU through a PCIe interface and which controls the external PHYs of the physical links. Moreover, a Linux driver and a library implementing custom communication APIs are provided. The TNW has been successfully integrated in two recent parallel machine projects, QPACE and AuroraScience. We describe some details of the porting of the TNW for the AuroraScience system and report performance results.Comment: 7 pages, 3 figures, proceedings of the XXVIII International Symposium on Lattice Field Theory, Lattice2010, June 14-19, 2010, Villasimius, Sardinia, Ital

    Externalities and the nucleolus

    Full text link
    In most economic applications, externalities prevail: the worth of a coalition depends on how the other players are organized. We show that there is a unique natural way of extending the nucleolus from (coalitional) games without externalities to games with externalities. This is in contrast to the Shapley value and the core for which many different extensions have been proposed

    Performance issues on many-core processors: A D2Q37 Lattice Boltzmann scheme as a test-case

    No full text
    Performances on recent processor architectures heavily rely on the ability of applications and compilers to exploit a more and more diverse and large set of parallel features. In this paper we focus on issues related to the efficient programming of multi-core processors based on the Sandy Bridge micro-architecture recently introduced by Intel. As a test-case application we use a D2Q37 Lattice Boltzmann algorithm, which accurately reproduces the thermo-hydrodynamics of a 2D-fluid obeying the equations of state of a perfect gas. The regular structure and the high degree of parallelism available in this class of applications make it relatively easy to exploit several processor features relevant for performance, such as, for example, the new Advanced Vector Extension (AVX) SIMD instructions set. However the main challenge is how to efficiently map the application onto the hardware structure of the processor. In this paper we present the implementation of our Lattice Boltzmann code on the Sandy Bridge processor, and assess the efficiency of several programming strategies and data-structure organizations, both in terms of memory access and computing performance. We also compare our results with that obtained on previous generation Intel processors, and with recent NVIDIA GP-GPU computing systems

    Appropriateness of repeated execution of laboratory examinations: a CDSS approach

    No full text
    Repetitive laboratory testing has become a well- recognized problem in the practice of medicine, especially in the hospital inpatient setting, since it increases costs and causes patient discomfort. Among the interventions proposed to reduce unnecessary testing, Clinical Decision Support Systems (CDSS) have been shown to be effective. We present the project of a CDSS recommending professionals in real time regarding the appropriateness for repeating laboratory exams, embedded in a Computerized Physician Order Entry at the Azienda Ospedaliero-Universitaria and Azienda Unità Sanitaria Locale of Ferrara, Italy. Test-specific time intervals within which test repetition is redundant are encoded in formal rules, which are applied against the laboratory results done in the past for the patient. The rules-set implemented concerns: clinical chemistry, hematology, coagulation, infectious diseases serology, microbiology, inflammation, cardiac and tumor markers, hormones, autoimmunity, allergology, molecular biology and drug monitoring testing

    Optimization of Multi-Phase Compressible Lattice Boltzmann Codes on Massively Parallel Multi-Core Systems

    No full text
    Abstract We develop a Lattice Boltzmann code for computational fluid-dynamics and optimize it for massively parallel systems based on multi-core processors. Our code describes 2D multi-phase compressible flows. We analyze the performance bottlenecks that we find as we gradually expose a larger fraction of the available parallelism, and derive appropriate solutions. We obtain a sustained performance for this ready-for-physics code that is a large fraction of peak. Our results can be easily applied to most present (or planned) HPC architectures, based on latest generation multi-core Intel processor architectures. Keywords: Computational fluid-dynamics, Lattice Boltzmann methods, multi-core processors Overview Fluid-dynamics critically relies on computational techniques to compute reliable solutions to the highly nonlinear equations of motion in regimes interesting for physics or engineering. Over the years, many different numerical approaches have been theoretically developed and implemented on state-of-the-art massively parallel computers. The Lattice Boltzmann (LB) method is a flexible approach, able to cope with many different fluid equations (e.g., multiphase, multicomponent and thermal fluids) and to consider complex geometries or boundary conditions. LB builds on the fact that the details of the interaction among the fluid components at microscopic level do not change the structure of the equations of motion at the macroscopic level, but only modulate the values of their parameters. LB then describes on the computer some simple synthetic dynamics of fictitious particles that evolve explicitly in time and, appropriately averaged, provide the correct values of the macroscopic quantities of the flow; see [1] for a complete introduction. The main advantage of LB schemes from the point of view of an efficient parallel implementation is that they are local (that is, they do not require the computation of non local fields, such as pressure). However, in recent years, the processing nodes themselves have included more and more parallel features, such as many-core structures and/or vectorized data paths: the challenge now rests in combining effectively inter-node and intra-node parallelism. In this paper, we report on a high-efficiency implementation of LB on a massively parallel system whose compute nodes are themselves state-of-the-art multi-core CPUs. This paper builds on previous work that addressed the sam

    Front propagation in Rayleigh-Taylor systems with reaction

    Get PDF
    A special feature of Rayleigh-Taylor systems with chemical reactions is the competition between turbulent mixing and the "burning processes", which leads to a highly non-trivial dynamics. We studied the problem performing high resolution numerical simulations of a 2d system, using a thermal lattice Boltzmann (LB) model. We spanned the various regimes emerging at changing the relative chemical/turbulent time scales, from slow to fast reaction; in the former case we found numerical evidence of an enhancement of the front propagation speed (with respect to the laminar case), providing a phenomenological argument to explain the observed behaviour. When the reaction is very fast, instead, the formation of sharp fronts separating patches of pure phases, leads to an increase of intermittency in the small scale statistics of the temperature field

    {A FPGA-based supercomputer for statistical physics: the weird case of Janus

    No full text
    In this chapter we describe the Janus supercomputer, a massively parallel FPGA-based system optimized for the simulation of spin-glasses, theoretical models that describe the behavior of glassy materials. The custom architecture of Janus has been developed to meet the computational requirements of these models. Spin-glass simulations are performed using Monte Carlo methods that lead to algorithms characterized by (1) intrinsic parallelism allowing us to implement many Monte Carlo update engines within a single FPGA; (2) rather small data base (2 MByte) that can be stored on-chip, significantly boosting bandwidth and reducing latency. (3) need to generate a large number of good-quality long (≥ 32 bit) random numbers; (4) mostly integer arithmetic and bitwise logic operations. Careful tailoring of the architecture to the specific features of these algorithms has allowed us to embed up to 1024 special purpose cores within just one FPGA, so that simulations of systems that would take centuries on conventional architectures can be performed in just a few months

    Janus2: An FPGA-based supercomputer for spin glass simulations

    No full text
    We describe the past and future of the Janus project. The collaboration started in 2006 and deployed in early 2008 the Janus supercomputer, a facility that allowed to speed-up Monte Carlo Simulations of a class of model glassy systems and provided unprecedented results for some paradigms in Statistical Mechanics. The Janus Supercomputer was based on state-of-the-art FPGA technology, and provided almost two order of magnitude improvement in terms of cost/performance and power/performance ratios. More than four years later, commercial facilities are closing-up in terms of performance, but FPGA technology has largely improved. A new generation supercomputer, Janus2, will be able to improve by more than one orders of magnitude with respect to the previous one, and will accordingly be again the best choice in Monte Carlo simulations of Spin Glasses for several years to come with respect to commercial solutions. © 2012 ACM
    corecore